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An  array  processor,  also  called  an  attached  processor,  is  a 
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form  repetitive  computations  on  well  structured  data  sets  at  ef¬ 
fective  speeds  far  beyond  those  achieved  by  minis  and  superminis 
Although  their  speeds  are  substantially  below  the  newest  gener¬ 
ations  of  large  scale  pipeline  and  parallel  vector  computers, 
they  do  provide  a  clear  cost/performance  advantage.  Further¬ 
more,  the  combination  of  a  minicomputer  and  an  array  processor 
provides  a  high  degree  of  interactivity  in  addition  to  attrac¬ 
tive  computational  speeds. 

Using  an  array  processor/minicomputer  combination  effect¬ 
ively  is  a  difficult  task,  requiring  not  only  knowledge  of  per¬ 
tinent  algorithms  and  computer  programming,  but  also  a  clear 
understanding  of  the  hardware  and  the  details  of  its  operation. 
The  objective  is  to  produce  efficient  implementation  of  pract¬ 
ical  algorithms.  These  implementations,  however,  should  event¬ 
ually  be  user  transparent,  in  order  to  be  of  benefit  to  the 
engineering  community. 

The  array  processor/minicomputer  combination  is  a  special 
case  of  a  multiprocessor  system  in  which  only  two  processors, 
with  widely  differing  characteristics,  are  involved.  One  inter¬ 
esting  factor  here  is  that  the  communication  between  the  two 
processors  lee  ;es  a  substantial  amount  of  control  in  the  hands 
of  the  programmer.  Experimentation  with  such  a  system  provides 
an  insight  into  the  more  general  area  of  parallel  processing. 

The  issues  here  relate  to  the  synchronization  of  computations 
and  data  transfers,  as  well  as,  the  design  of  operating  systems 
and  applications  programs  to  handle  the  solution  of  complex 
problems . 

Several  operations  typical  of  finite  element  analysis  were 
chosen  as  a  basis  for  performance  measurements.  Some  experi¬ 
ments  measured  the  speeds  of  execution  on  the  host  computer  and 
the  array  processor,  acting  as  separate  processors.  Other  mea¬ 
surements  were  designed  to  assess  the  performance  of  the  integ¬ 
rated  system.  In  addition  to  these  measurements,  simulation 
was  used  to  predict  the  performance.  The  simulation  results 
are  matched  to  the  measured  results  in  order  to  gain  confidence 
in  the  simulation  procedures.  Once  this  has  been  accomplished, 
the  same  methods  are  used  to  predict  the  performance  more  gen¬ 
erally  . 

Measurements  were  taken  for  simple  matrix  operations, 
numerical  integration  of  solid  element  properties,  and  the  de¬ 
composition  of  a  hypermatrix.  In  the  case  of  the  computation 
of  stiffness  matrices  for  the  solid  element,  several  alternative 
strategies  were  attempted,  some  of  which  were  designed  to  mini¬ 
mize  data  transfers  between  the  host  and  the  array  processor. 

Finally,  the  experi  ,*nce  gained  so  far  is  used  to  examine 
the  possibility  of  implementing  several  algorithms  typical  of 
the  effectiveness  of  the  system  in  execution  of  such  computa¬ 
tions  is  made. 
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SUMMARY 

"The  paper  describes  ongoing  efforts  to  investigate  the  usefulness  'if 
so  called  ''attached  processors*  or  Virrav  processors*,  combined  with  inini- 
and  superminicomputers,  in  the  analysis  of  computationally  demanding 
engineering  problems;  such  as,  in  finite  element  nonlinear  applications.  A 
close  examination  of  some  basic  algorithms,  which  pi av  an  important  part 
in  such  analysis,  is  presented.  Furthermore,  a  methodology  is  developed, 
which  foims  a  useful  guide  to  further  effort;'.  of  this  nature,  including 
the  more  general  case  of  multiprocessing.  An  emphasis  is  made  'in  t.  he 
ability  Lo  predict  the  performance  of  algorithms  from  time  measurements  of 
certain  basic  operations,  coupled  with  an  understanding  of  the 
characteristics  of  the  hardware  in  question  and  the  interplay  between  its 
various  components.  Four  hardware  alternatives  an  considered,  two  oj 
which  have*  attached  arr.iv  piocessors.  Of  the  algorithms  considered,  it 
appears  that  the  decomposition  process  is  the  one  which  benefits  most  Irmi 
the  presence  of  an  array  processor.  The  speed  advantage  gained  in 
stiffness  assembly,  and  forward  and  backward  substitution  inrv  both  be 


obtained  by  purchasing  a  more  powerful  superminicomputer. 


INTRODUCTION 

The  advent  of  "super  computers",  such  as  the  STAR  and  CRAY  series, 
and  the  HEP  computer  [1],  provides  an  undoubted  breakthrough  for  large 
scale  finite  element  analysis  1 2 ] .  The  availability  of  inexpensive 
microprocessors  has  triggered  efforts  to  devise  microcomputer  networks, 
using  parallel  and  distributed  processing  to  provide  faster  solution  of 
finite  element  problems  [3,4],  Finite  element  software  is  moving  into  a 
new  ere,  in  which  the  proper  programming  environment  and  modular  program 
systems  [5]  are  replacing  monstrously  large  programs. 

In  contrast  to  efforts  based  on  high  per fojmance  sixth  generation 
computers  [2,6],  this  paper  describes  ongoing  efforts  [7]  to  investigate 
the  usefulness  of  so  called  "attached  processors"  or  "array  processors", 
combined  with  mini-  and  superm i n i c  •nupul ors ,  in  the  analysis  of 
computationally  demanding  engineering  problems;  such  as,  in  finite  element 
nonlinear  applications.  The  work  presented  here  does  not  include  actual 
solutions  of  such  problems.  Rather,  a  close  examination  of  some  basic 
algorithms,  which  play  an  important  part  in  nonlinear  analysis,  is 
presented.  Furthermore,  a  methodology  is  developed,  which  forms  a  useful 
guide  to  further  efforts  of  this  nature,  including  the  more  general  case 
of  distributed  processing.  An  emphasis  is  made  on  the  ability  to  predict 
the  performance  of  algorithms  from  time  measurements  of  certain  basic 
operations,  coupled  with  an  understanding  of  the  characteristics  of  the 
hardware  in  question  and  the  interplay  between  its  various  components. 

The  array  processor/minicomputer  combination  is  a  special  case  of  a 
multiprocessor  system  in  which  only  two  processors,  with  widely  differing 
characteristics,  are  present.  Experimentation  with  such  a  system  provides 
an  insight  into  the  more  general  area  of  parallel  processing.  The  issues 


here  relate  to  the  synchronization  and  balancing  of  computations  and  data 


transfers,  as  well  as  the  design  of  operating 


systems  and  applications 


programs  to  handle  the  solution  of  complex  problems  [8,9,10], 

Measurements  taken  on  an  array  processnr/min i computer  system  show 
that,  while  the  array  processor  is  perhaps  80  times  faster  than  the  host 
16  bit  minicomputer  in  the  performance  of  basic  matrix  arithmetic,  it  is 
often  not  possible  to  apply  this  computational  power  to  practical 
problems,  due  to  the  various  restrictions  imposed  by  the  limited  usee 
address  space.  An  alternate  system  is  now  being  put:  together  in  which  a  ?2 
bit  minicomputer,  with  a  large  address  space  and  a  faster  data  transfer 
rate,  is  used  as  the  host. 

Several  algorithms  tvpica!  of  finite  element  analysis  were  chosen  a ' 
a  basis  for  performance  measurements.  Some  experiments  moasir.  ed  the  speeds 
ot  execution  on  the  host  computer  and  the  array  processor  acting  as 
separate  processors.  Other  measurements  were  designed  to  assess  the 
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problems  may  appear  competitive,  one  should  not  forget  that  severe 
restrictions  are  placed  on  problem  size  due  to  address  limitations, 
rendering  these  figures  purely  hypothetical,  except  in  some  dedicated 
systems  . 


NOMENCLATURE 

It  Is  useful  to  introduce  some  symbols  and  abbreviations,  at  this 
point,  in  order  to  describe  alternate  computer  configurations,  and  refer 
to  problem  parameters.  These  abroviations  are: 

HC16  Refers  to  16-bit  minicomputer  used  independently  or  in 
conjunction  with  an  array  processor.  It  is  characterized  bv  a 
limited  user  address  space  (typically  64  KR)  and  a  slow  data 
bus  (typical  speed  up  to  1.5  MB/sec). 

H032  Alternate  32-bit  superminicomputer  used  independently  or  with 
an  array  processor.  Since  the  array  processor  is  assumed  to 
provide  the  primary  computational  tool,  the  selected  HC32 
system  has  the  advantages  of  a  32-bit  mincomputor  of  a  large 
address  space  (16  MB)  and  a  fast  data  bus  (speed 
2t>.5  MB/sec).  On  the  other  hand,  its  floating  point 
processing  speed  is  somewhat  limited  (0.435  MFI.OPS  in 

Whetstone  tests,  as  opposed  to  the  11C16  time  of  0.190). 
Todays  popular  supermini. s  have  been  measured  at  1.2  MFLOPS , 
and  the  top  of  the  line  of  our  UC32  series  has  been  measured 
at  3.5  M FLOPS . 

AP  Stands  for  array  processor.  The  one  used  in  this  research  is 

primarily  a  double  precision  device,  capable  of  performing  64 
bit  vector  arithmetic,  and  other  well  structured 
computations,  at  speeds  well  above  those  of  mini-  and 


superminicomputers.  So  far,  however,  it  has  been  used  in  the 


nmun 


single  precision  (  32  —  hit  ">  mode,  due  to  address  sp.icu 
limitations.  A  single  precision  vrri/int  of  the  same  ar r;:v 
processor  perform  the  saint  liiuei  ions  fast'  r  and  are 

considerably  less  expensive. 

HC16+AP  Combination  of  the  tIC  1 6  and  AT.  Tile  d  evict's  communicate  via  a 
slow  interface,  using  the  HC 1 6  low  speed  data  bus 

extensively . 

HC32+AF  Combination  of  HC32  and  AT,  which  arc  coupled  via  a  high 
speed  direct  memory  interface. 

I'  Total  number  of  unknowns  in  the  problem  being  solved. 

b  Half  bandwidth  of  problem.  Based  on  element  connectivity. 

N  Rtibmat  r  ix  size  used  to  partition  the  svstem  of  equations. 

T  '1  dal  number  of  partitions  in  the  stiffness  matrix. 

H  Number  of  non-zero  submatrices,  per  row,  in  stiffness  matrix . 

M  Number  of  longitudinal  stations  in  the  solid  model  being 

anal vxed  . 

N  Number  of  nodes  across  the  solid  model  in  either  direction. 

K  Total  number  of  g  nod ed  isoparametric  solid  elements  used  to 

model  the  solids  problem. 


HARDWAK K  CHAKATER  1ST1CS 

An  array  processor,  also  called  an  attached  processor,  is  a  high 
speed  special  purpose  computational  device,  which  can  perform  repetitive 
computations  on  well  structured  data  sets  at  effective  speeds  far  beyond 
those  achieved  by  minis  and  superminis.  Although  their  speeds  arc 
substantially  below  the  newest  generation  of  large  scale  pipeline  and 
parallel  vector  computers,  they  do  provide  a  clear  cost /per formance 
advantage.  Furthermore,  the  combination  of  a  minicomputer  and  an  array 
processor  offers  a  high  degree  of  interactivity  in  addition  to  attractive 


c  ompu t  a  t i on a  1  speeds. 


The  array  processor  is  a  sepai  ite  computer,  running  under  an 
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involving  extensive  decision  making.  Data  transfers  and  message  swapping 
between  the  host  and  the  array  processor  are  problematic  and  can  affect 
performance  appreciably,  as  seen  from  some  of  the  measurements  taken.  The 
amount  of  AP  local  memory  may  he  limited  in  certain  instances.  "his, 
however,  does  not  apply  to  the  system  under  consideration. 

Several  AP  types  are  available  today.  We  restrict  the  discussion  to 
the  specific  device  used  in  this  research  which,  we  believe,  represents 
the  state  of  the  art.  In  order  to  understand  the  operation  of  the  system, 
Fig.  1  explains  the  two  basic  components,  and  the  various  software  items 
which  control  it,  as  well  as  the  location  of  the  data.  The  operating 
system  of  the  host  computer  controls  the  operation  of  the  host .  including 
t  lie  user  program,  which  specifies  the  sequence  of  events  me  ssarv  to 
perform  the  computation.  A  special  piece  of  software,  called  l.c  DRIVER, 
resides  in  the  host  computer,  ft  facilitates  communication  with  the  arr.iv 
processor  bv  translating  user  FORTRAN  calls  into  small  packages  called 
Function  Control  Blocks  or  FCBs  ,  which  are  shipped  to  the  array  processor 
to  initiate  specific  computations  or  data  transfers  on  the  latter  device. 
The  format  of  a  typical  FCB ,  which  occupies  six  16-bit  words  ( l!C  1 6 )  or 
three  32-bit  words  ( 11(132),  is  given  in  Fig.  2.  It  contains  a  function  code 
and  operand  addresses,  which  take  the  form  of  array  processor  buffer 
numbers,  as  well  as  some  control  variables  ( checksum,  etc.). 

On  the  array  processor  side,  an  independent  operating  system,  (.ailed 
the  EXEC,  resides.  Its  function  is  to  control  the  operation  of  the  arrav 
processor  itself.  In  the  array  processor  there  exists  a  matliematic.il 
library,  consisting  of  a  number  of  unlinked  (relocatable)  routines,  which 


may  be  assembled  into  any  program  being  executed  inside  t  h  <  -  arr.v 
processor.  One  of  the  functions  of  KXKC  is  to  link  such  progress  ,  as 
needed,  to  the  specifications  contained  in  the  FCBs ,  and  control  l  in' i  r 
execution  within  the  array  processor.  Such  programs  take  the  form  of  tu  > 
separate,  but  fully  synchronized ,  programs  called  the  Al’U  and  APS  programs 
respectively.  Data  are  stored  within  the  array  processor  while  he: nr 
operated  upon  in  special  areas  called  buffers.  One  important  consider, at  i-u- 
tor  the  system  at  hand  is  that,  in  the  FCB ,  a  restrict  ed  number  of  hits  is 
used  to  store  the  buffer  number.  This  poses  a  restriction  or.  the  number  oj 
buffers  which  may  be  defined.  At  the  moment,  the  maximum  number  ot  buffers 
is  64.  Soon  it  will  be  expanded  to  512.  The  hardware  confjgurat'  ions  !  >r 
the  HC16+AP  and  the  HC32-»AP  are  given  in  Fig.  ?  .and  Fig.  4  respectively. 

BASIC  MKASVKEMfcNTS 

in  attempting  to  assess  the  e  f  feet  ivonoss  of  part  icular  liardware 

system,  such  as  the  one  being  considered  here,  several  approaches  are 
p'ssihle.  One  approach  is  to  run  specific  problems  on  an  existing  svslm. 
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such  •>[)!*  r  l  i  ins  ,  in  1  •  ■  >  t  fi  host  r  •wput  crs  and  in  t  ho  AP,  have*  been  obi  air. 
Fipures  (■•  and  7  show  !  ime  estimates  with  a  comparison  of  t  he  re]  al 
sp< <  Is  of  host  e  maput  er  and  AP.  It  has  boon  observed  that  speed  rai  >  is 
up  to  F(  .ii'"  possible  lor  I  arge  matrices.  Examples  of  basic  task  limes 
a  dOX.'i  mat  r  i  x  a  i  .  j>  i  v  -si 


i  n  'I  ah  I  e  1  . 


STIFFNESS  COMPUTATION  AND  ASSEMBLY 


Tlu'  process  of  stiffness  assembly  is  a  basic  and  important  - < n < -  for 
b  o  i  li  linear  and  nonlinear  analysis.  litre.  two  basic  functions  are 
per  formed ,  element  stiffness  computation  and  assembly  info  tie-  master 
stiffness  matrix.  For  the  purpose  of  t  h  i  *  investigation,  the  solid  mob,.’, 
shown  in  Fig.  8  vws  chosen,  iiv  varying  the  number  of  stations  CM)  and  the 
number  of  nodes  in  each  direction  •  •  f  the  c  ms  s- sec  l  i  on  (K),  <lil’!'eieut 
combinations  of  problem  size  and  bandv  idtb  are  produced,  which  form  the 
basis  lor  a  parametric  study.  One  important  assumption  made  was  that  t  In- 
core  mein  >ry  must  be  large  enough  to  contain  the  shaded  area  *f  t  he 
st  ilfness  matrix,  shown  in  Fig.  8.  In  this  wav,  the  snhmat r ices  o ;  in 
stillness  livpe  iii>at  r  ix  are  generat'd  in  cop.-  and  written  to  the  disc  or.lv 
•nc.  . 

tiii;.  point  .  it.  is  appropriate  t  >  briar  up  the  is'in-  of  pi  >|>  1  m 
size  limitations,  which  apply  here  equally  for  stiffness  assemble  aid 
■  lec  "i.ipos  i  t  i  ui .  Fig.  fJ  illustrates  this  point.  Several  tac'ors  limit 
pi  obi*  sh  si;t"  ,  and  t  lie  se  constraints  are  represented  by  slraig.li!  1  ines  or 
curves  in  the  figure.  For  example,  the  eftoct  of  the  limited  number  of 
but  fers  is  sh  iwn  bv  vert  ical  1 ines.  At  present  .  the  system  has  a 
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Ai  :s  urrontlv  •  ipial  t  >  lb  KW ,  minus  II  KW  needed  for  KXKO  and  t 
mat  hem/:  t  i  al  library.  Since  in  tin  current  decomposition  routine 
submat r ices  are  placed  in  this  memory,  a  submatrix  size  of  30X30  can  not 
bi  exceeded.  If  the  memory  is  expanded  to  32  KW,  it  is  possible  to  employ 
62X62  matrices;  and  it  extended  to  f, A  KW,  matrices  over  100X1  DO  may  In 
handled.  With  the  total  space  available  on  memories  2  and  "3,  a  maximum 

halt  bandwidth  h,  of  approximately  ’’20,  ean  be  handled.  This  is 

illustrated  by  the  first  of  ,i  set  o'  curves  showing  the  passible 


combinations  of  S  and  H  whieh  mnv  bo  handled.  ff  the  two  it.-mnri.-s  are 
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during  tile  assembly  process.  The  I  into  required  for  these  transfers  was 


also  computed  and  included  in  the  estimate.  Results  from  various  runs  show 
that  the  stiffness  computation  process  is  speeded  up  by  a  constant  fact  t 
of  a  ppr  ox  imat  e  1  v  5  lor  the  11(116,  and  3.5  tor  HC32,  report!  1*  ss  of 
bandwidth,  bv  adding  the  array  proeessoi .  This  result  is  somewhat 
disappointing  perhaps,  hut  similar  conclusions  have  been  reached  by  others 
[2],  Table  2  gives  the  operations  count  and  timing  for  a  realist  if 
(20X12X12)  problem  where  the  number  of  unknowns  ’s  F6rt0  and  tie-  be  If 
bandwidth  is  900. 
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program  and  the  opt  inmm  algor i  t  lnu  ,  some  di  f  ferences  in  the  counts  were 
observed.  In  order  to  remove  the  el  feet  of  the  d i screpenc v ,  a  correction 
was  applied,  and  both  In  collected  and  uneorn c t ed  figures  are  given.  in 
both,  cases,  tile  •  rt  or  s  a  to  wll  within  tin  accepted  limits,  part  id  1  ar  1  v 
when  the  measured  t  imes  can,  themselves,  vary  to  the  same  order  of 
magn  i  t  tide  . 'Ill  i  s  is  illustrated  bv  one  of  the  cases,  which  was  run  twice  and 


shows  a  difference  >.  >  1  approximately  10  percent.  Hst  iinntes, 
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seem  to  In-  n:  >i <•  .'iccur.it  o  for  narrow  banded  mat  r  i  <  es ,  which  is  red 

surprising  since  this  was  one  of  the  basic  assumptions  made  in  the 

operation  count  estimates.  The  adjusted  run  times  are  more  accurate  than 
the  unadjusted  ones.  Fig.  10  gives  a  pictorial  representation  of  the 
results  for  a  set  of  decompositions  where  the  bandwidth  is  kept  at  a 
const  tint  value  of  270.  The  results  give  a  certain  amount  of  confidence  in 
the  basic  approach  used  in  this  paper  and  allows  us  to  extend  fhi 
estimates  to  other  algorithms. 

FORWARD  AND  BACKWARD  SUBSTITUTIONS 

The  same  procedure  applied  so  far  to  tin  stiffness  assemble  and 

decompisi t i on  is  next  employed  to  obtain  estimates  of  elapsed  times  for 
the  to; ward  and  backward  substitutions  typical  of  nonlinear  iterations. 
The  results,  for  the  same  solid  model  described  above,  are  giver.  in. 


';ib  }  i* 

r. 

'Iliev  show 

a  d 

>■  t  i  n  i  t  e  , 

Ve 

t  sin  a  1 

i , 

.•?<!  v 

ant  age 

for 

t  h 

e  1!C  "t 

:  A  F 

*5  V  S  t 

fin 

V'*r 

t  ho 

11(13?  .  One 

Pf' 

ib  1  em 

SC 

i  111  s 

L  o 

be 

th 

e  inef 

fee 

t  iv 

e  n  e  s  s 

of 

s  imp 

Ip 

<Hit 

i  •MIS 

i  aside 

the 

•  a  r 

mv 

pr 

f'cessnr. 

In 

that 

c a 

sc , 

i  t 

m  i 

ght 

p  t . 

d  */o  n 

.ms  t  o  pe 

r  for 

m  th 

!  S 

P  •'  r 

L  of 

th 

p 

c  oitqiu  t  a  t  i  on 

i  n  s  i  d 

e  t 

he  host 

•input  or . 

This  icon 

Id  y, 

peed 

-i;p 

the 

pr  •'H'c 

ss 

somewhat  , 

5:  i  v 

ing 

;) 

tot  a 

1  spe 

( .  p 

P  1  actor 

of  5.T? 

over 

t  lie 

IIC ' 

*2  a 

1  one . 

At 

t  h  i 

s  point 

i  n 

t 

i  m  o  , 

i  t 

a  ppe  a 

r  s 

h«7  L 

the 

ad  van t  age 

of 

t  he 

n  r  n 

:  v 

preens 

S  o  1 

i 

s  not 

n  s 

<; 

ub  s  t  a 

n  t  i  a 

1  her 

However,  careful  men  sur  (  m.-ti  t  s  are  required  before  final  conclusions  are 
mad  e . 


ON  THE  ADVANTAGES  OF  PARALLEL  COMPUTATION 


In  a  system  incorporating  multiple  asvivhron  ous  devices,  the  question 
com  os  to  mind  as  to  the  possible  advantage  r  of  parallel  c  oinpu  t  t  i  on . 
Whether  an  algorithm  can  be  nccel  1  crated  by  allowing  the  different 
hardware  modules  to  operate  independently,  performing  different  but 
related  functions  at  the  same  time,  depends  ->n  both  the  algorithm  and  the 
hardware  configuration  |7).  Fig.  11  shows  a  hypothetical  case  for  which 
parallel  processing  is  to  be  applied  to  a  specific  algorithm.  Three 
devices  are  considered,  a  host  computer,  a  data  transfer  bus,  and  an  array 
processor.  Fig.  !l.a  illustrates  the  sequence  ot  operations,  in  real  t  ime. 
for  the  given  system,  assuming  serial  computation.  The  same  process  may  be 
represented  more  clearly  in  Fig.  11. b,  where  it  is  evident  that  the  array 
processor  is  not  being  used  sufficiently.  The  host  and  data  bus,  on  th 
other  hand,  are  used  approximately  half  the  time.  The  maximum  benelit  to 
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actual  run  such  as  that  shown  in  Fig.  11. cl. 

Whether  a  particular  algorithm  can  benefit  from  parallel  computation 
or  not,  r,  "st  first  be  determined  by  a  simple’  analysis  of  the  demand  on 
computer  resources.  It  the  largest  amount  ■ .  time  is  spent  in  one 
particular  device,  parallel  operation  will  most  certainly  not  help.  It 


might  be  puss 
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shift  some  of  the 

load  from 

one 

device  to  another, 
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tween  computation 
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data  transfer  times.  On  the  other  band, 


it  the  process  suffers  from 


excessive  <i a t a  transfer  times,  data  buffering,  faster  data  busses,  .•■long 
with  multiple  disk  drives  rind  controllers  m.'iy  b«  used.  Careful  p  I  mini  ng  ot 
system  I/O  operation  would  also  help.  Once  it  is  determined  that  ■'it', 
algorithm  is  potentially  suitable  for  parallel  computation,  detailed 
analysis  in  which  the  discrete  nature  of  the  demands  on  each  system 
component  is  taken  into  consideration,  is  worth  undertaking  [7]. 

U'NCl.l’S  I  OKS 

The  paper  has  succeeded  in  devising  a  methodology  for  predicting  the 
performance  improvement  as  a  result  of  the  addition  of  an  array  processor 
tii  a  minicomputer.  Four  systems  wi  re  e  msidered,  two  of  which  have 
•attached  array  processors.  Of  the  algorithms  considered,  it  appeals  that 
the  decomposition  process  is  the  one  which  benefits  most  from  the  presence 
•f  an  rrrav  processor .  The  speed  advantapi  gained  in  stiffness  assembly, 
and  forward  and  backward  substitutions  mav  both  be  obtained  bv  purchasing 
a  more  powerful  superminicomputer.  As  such,  it  appears  that  a  mix  of  jobs, 
in  which  the  host  c  impute;-  is  freed  to  handle  other  tasks  such  as  pro-  and 
postprocessing  along  with  linearized  iterations  while  the  array  processor 
is  occupied  in  the  deconiposi  t  ion  process,  might  be  advantageous.  One  mil’ll! 
think  of  a !  t  a  rna t i ves  to  traditional  com put  at i onnl  strategies.  Ft 
example,  in  the  modified  Newton  method,  the  host  computer  mav  be  used  to 
improve  on  a  current  solut  ion  vector  based  on  a  previously  obtained 
di c ompos i t i on ,  while  the  array  processor  is  used  to  compute  and  decomp  >se 
a  in  w  stillness  matrix  based  on  a  more  recent  solution  point.  Using,  the 
current  time  esi imates,  the  advantage  gained  mav  he  small.  Only  c 
iterations  can  be  performed  in  the  time  required  by  the  HC32+AP  to  compute 
a  new  stiffness  and  decompose  it.  Sparse  matrix  computations  mav  be  used 
for  the  iterative  procedure,  rather  than  a  direct  solution.  It  is  too 
early  to  t i nd  out  whether  such  algorithms  could  he  effective  on  an  arrav 
processor . 


V 


One  must  state,  however,  that  the  apparent  advantages  of  the  arrav 
processor  in  the  decomposition  algorithm,  which  seems  to  he  the  most  t  ime 
consuming  of  all  computational  steps,  is  not  to  be  taken  lightly.  it.  is 
clear  that  such  speeds  can  not  bo  achieved  with  simple  computer  dee  i  r  -•  s 
with  the  same  price  range. 

Many  other  conclusions  come  to  mind.  A  16-hit  minicomputer  with 
limited  address  space  does  not  appear  to  he  suitable  as  a  host  for  a  last 
array  processor.  The  restricted  memory  size  drast  icallv  limits  pr obi •  m 
size.  A  conventional  commiinic.it  ion  interface  is  not  adr  tjuate  if  the 
min ic omputer/ arrav  processor  combination  is  to  he  utilized  e  t  fee  t  i  ve  1  v  .  le. 
using  the  hvpermatrix  scheme,  small  suhmat  rices  mv  l  >  h.-  avoided .  A 
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.  Ifeetively  is  ..  valid  idea  and  will  play  .in  increasingly  greater  role. 
Future  hardware  will  probably  take  the  form  of  a  network  of  such  systems  . 
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ITEM 

AP 

HO  16 

HC32 

HC-DK 

5  5.33 

9.40 

HC-AP 

45.87 

16.10 

Mult ipl i cation 

73.58 

5448.40 

2724.20 

Add  it  ion 

10.00 

65.50 

32.75 

Inver s ion 

112.34 

8039.34 

4019.67 

MRB  transfer 

8 . 50 

1 .00 

FCB  transfer 

8 . 50 

i  .  00 

Table  !.  TIMING  OF  BASIC  OPERATION  FOR  (40X40  SURMA TRICKS ) 

(Times  in  msecs.) 


ITEM 

HC  16 

11C  32 

HC16+AP 

SPEED  UP 
FACTOR 

HC32hAP 

SPEED  UP 
FACTOR 

T  rans  for s 

23  7.7 

3  5.7 

237.7 

1  .000 

35.7 

1  .000 

Elmi'.L.  C'inp. 

7687.9 

384  3.0 

1  200.  1 

6.406 

1 078. 2 

3.565 

Total 

7925.6 

3879.7 

1437.8 

5.512 

1114.0 

3.483 

Para!  Id 

7687.9 

3843.9 

I  200.  1 

6.406 

1078.2 

3  .  565 

NOTH:  HC16  times  are  hypothetical  duo  to  memory  restriction*!. 

T  ah',  i-  2.  RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(STIFFNESS  ASSEMBLY)  (Times  in  secs.) 


ITEM 
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FACTOR 
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555952 . 1 
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1042.7 
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51.9 
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61.9 

37.635 

Total 

562419.3 

281043.3 

8704.3 

64.614 

7472.7 

37.609 

Para  1 1 o 1 

561 943.8 

280971 .8 

7339.2 

76.567 

7339.2 

38 .283 

Table  3.  RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(DECOMPOSITION)  (Times  in  sees.) 
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300 

0.60 

42 

4  5 

7 

- 

- 

15 

16 

9 

240 

0.6O 

38 

34 

1  2 

- 

- 

15 

10 

9 

1  50 

0.60 

3  3 

31 

6 

- 

- 

Table  4.  COMPARISON  of  MEASURED  and  PREDICTED  TIMES 
(MATRIX  DECOMPOSITION)  (Times  in  secs.) 


ITEM 

lie:  1 6 

HC  3  2 

HC16+AP 

SPEED  UP 
FACTOR 

11(132  +AP 

SPEED  UP 

FACTOR 

Tn  r  n  s  f  e  r  s 

934.3 

140.5 

1  647.8 

.  567 

140.5 

1  .(>00 

Mu  1 1 ns  . 

2564.6 

1 282.3 

64.9 

79.540 

64 . 11 

19.770 

Add  i  t  i.  on 

30.9 

15.5 

8  3 . 3 

.  371 

8  3.3 

.  185 

Inversion 

FCB  transfer 

.0 

.0 

.0 

416.7 

.  000 

.0 

49.0 

.  000 

TOTAL 

3529.8 

1438.2 

2212.7 

1  .  595 

33  7.7 

4.2  59 

PARALLEL 

2395.5 

1 297.8 

2064.5 

1  .257 

189.5 

6 . 848 

T able  5 . 


RELATIVE  PERFORMANCE  FOR  TYPICAL  SOLID  PROBLEM 
(FORWARD  AND  BACKWARD  SUBSTITUTIONS)  (Times  in  sees.) 


FIGURE  CAUTIONS 


1  Schematic  of  Control  and  Patti  Flow  between  H.C.  and  A.P. 


2  FOB  Format . 


3  System  with  ! 6  Bit  Host. 


5.a  Data  Transfer  Speed 
(HC16-AP) 

5.b  Data  Transfer  Speed 
(HC16-DK) 

6  Array  Processor  Performance 
(Matrix  Mu  1 1 i pi i c  a t i on  Ti  me ) 

7  Array  Processor  Performance 

(Matrix  Inverse) 

8  Solid  Model  Example 

9  Problem  Size  Limitation 

(Single  Precision) 

10  Matrix  Decomposition  on  HC16+AP 

Estimates  vs.  Measurements 


11  Serial  and  Parallel  Processing 


FIG. 4  SYSTEM  W”H  32  BIT  HOST 
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